DNA methylation (DNAme) is a process by which methyl groups are added to the DNA molecule. Methylation could change the activity of the DNA without changing it’s code. It serves as a critical phenomenon to several biological functions and has a role in certain disease states. Further, DNAme is a privileged candidate for epigenetic inheritance due to its plasticity and stability across mitosis and meiosis. It is one of the major mechanisms of epigenetic modification and has a fundamental influence on gene expression, genomic stability and cellular activity [ ].
With the advancement of next-generation (NGS) sequencing techniques, there are several methods for studying DNA methylation, but, few offer a better resolution of methylation status such as bisulfite sequencing (also known as Bisulfite-Seq, BS-Seq, Methyl-Seq or WGBS). The key idea of the method is combining the power of high-throughput DNA sequencing with the treatment of DNA with sodium bisulfite. The human methylome contains around 28 million CG sites in humans. There are also non-CG methylated sites: CHG or CHH, where H represents adenine, thymine, or cytosine.
Whole Genome Bisulphite Sequencing (WGBS) is a gold standard NGS technique for studying DNAme in the genome of mammals. WGBS combines sodium bisulfite conversion of the DNA sequence with high throughput DNA sequencing. The sodium bisulfite reaction converts the unmethylated cytosines (C) to uracils (U), whereas, methylated C remains unchanged. These U are further converted to thymine (T) after the polymerase chain reaction (PCR). Although, no change occurs for methylated C. Methylcytosines are the methylated versions of the cytosine bases. This is done by transferring a methyl group onto the C5 position of the cytosine.
The aim of NGS-based DNA methylation analysis is to investigate genomic DNA and find out whether single cytosines or entire regions in the genome are methylated or not. WGBS is the most comprehensive sequencing for DNA methylation profiling, allowing single-base resolution of 5-mC within the whole genome. By comparing treated and untreated sequences, the location of the methylated cytosines is determined.
There are several applications of WGBS, including (not limited to): - detection of differentially methylated regions (DMRs) - detection of differentially methylated loci (DMLs) - detection of copy number variations (CNVs) - detection of single nucleotide polymorphisms (SNPs) - detection of cytosine methylation levels of transcription factor binding sites (TFBSs) - detecting methylation level or distribution (globally, single nucleotide, chromosome-wide and gene-centric)
Sequencing the whole genome is generally quite expensive. Therefore, although it has been applied to large genomes such as the human genome, large numbers of individual samples are seldom sequenced. Reduced representation bisulfite sequencing (RRBS) has been developed for this reason, in which the bisulfite reaction occurs but the sequencing is limited to around 1% of the genome. This enables the sequencing of the genome of several individuals.
library(DiagrammeR)
grViz("
digraph flowchart {
graph [overlap = true, layout=dot,fontsize=12]
node [shape = box, style=filled, fillcolor=NavajoWhite, color=DarkslateGray, fontsize=12];
A; B; E; F; G;
node [shape = egg, style=filled, fillcolor=NavajoWhite, color=DarkslateGray, fontsize=12];
C; D;
A [label='Quality Check: FastQC']
B [label='Quality Control: trim_galore']
C [label='Alignment: Bismark']
D [label='Methylation extraction: Bismark']
E [label='Overview of QC: MultiQC']
F [label='Differential analysis: DMLs']
G [label='Differential analysis: DMRs']
edge[color=DarkslateGray,
penwidth=5
]
A->B
B->C[label = ' trim adapters, remove shorter reads, trim Ns']
C->D
C->E
A->E
B->E
D->F[label = ' DMLs: single basepair resolution']
D->G[label = ' DMRs: identify regions']
}
")Figure 1. Basic pipeline for WGBS data analysis.
Two data sets were used.
One, representing negative control (with no expected DMRs), contained six methylation profiling samples from normal human dendritic cells (data in GEO). The samples were artificially divided into two groups of three samples, group 1 consisting of samples GSM1565940, GSM1565944, and GSM1565948, group 2 consisting of samples GSM1565942, GSM1565946, and GSM1565950. To obtain reasonable running times, only methylation data from chromosome 18 were used.
The second set of data was created from the negative control by adding 100 simulated DMRs using simDMRS function from R package dmrseq.
The data are the same as were used in the dmrseq paper
The reported DMRs were (if possible) filtered for only those that contain at least 10 CpGs and the mean difference between the two groups is at least 0.1 (the simulated regions had effect sizes between 0.163 and 0.450).
In total 7 different methods were used for the comparison. Their main features, as well as links to the analysis files are summarized in the Table 1:
| Method | Preprocessing + statistical test used | p-value estimate given | CpG / non CpG | Implementation language | DOI | Availability | Analysis file |
|---|---|---|---|---|---|---|---|
| bsseq | local-likelihood smoothing, t-test similar statistics | no | CpG only | R | 10.1186/gb-2012-13-10-r83 | Bioconductor package bsseq | Report |
| CGmapTools | unpaired t-test | yes | CpG only | C, C++, Python | 10.1093/bioinformatics/btx595 | binary package | Report |
| defiant | Weighted Welch Expansion | yes | CpG only | Shell | 10.1186/s12859-018-2037-1 | binary package | Report |
| DMRcaller | no preprocessing, or pooling in bins, or kernel smoothing; Fisher’s exact or Score test | yes | all contexts | R | 10.1093/nar/gky602 | Bioconductor package DMRcaller | Report |
| dmrseq | smoothing; generalized least squares | yes | CpG only | R | 10.1093/biostatistics/kxy007 | Bioconductor package dmrseq | Report |
| methPipe | Fisher’s exact test | no | CpG only | C, C++ | 10.1371/journal.pone.0081148 | binary package | Report |
| methylKit | Fisher’s exact test, or logistic regression with tiling | yes | all contexts | R | 10.1186/gb-2012-13-10-r87 | Bioconductor package methylKit | Report |
Distribution of the coverage and beta values of all the samples, shown in Figure 2A and Figure 2B and Figure 3A and Figure 3B, respectively, depicts that there is not much difference between individual samples. Also, the PCA plot, Figure 4A and Figure 4B, shows that there is no clustering of the two artificially created groups.
There was no single region that would all methods overlap by at least 1bp.
Figure 5 shows one simulated DMR, generated as described in the Data section. This region has the highest delta (methylation difference) value in the list of all generated DMRs (0.447). This region was identified as DMR by 6 methods (not identified by methpipe; also not identified by DMRcaller bins method).
Similar plots for all 100 simulated DMRS are shown in this report.
We considered any overlap between the simulated and identified DMRs to be sufficient for the region to be classified as true positive.Details of computing true positives and false positives are described in this report.
As per negative control, we do not expect any true DMRs, all the identified regions are thus false positives. dmrseq and defiant methods did not find any DMRs in the negative control data, whereas others did, as shown in Figure 6 and Table 2.
Figure 7 shows that dmrseq performs better than the other methods. On x axis is the observed FDR, on y axis the power, both for different specified FDR cut-offs. The computation of FDR and power is described here.
We have also checked how well the methods control FDR, Figure 8. For this, we plotted the observed FDR (the proportion of false discoveries) and the specified FDR (cutoff after multiple testing correction by BH). We can see that dmrseq again outperforms the other methods.
BSseq and methpipe do not allow user control of FDR, and are thus not depicted in the Figureure.
More comparison plots as well as the code can be found here.
In this project, we wanted to compare the methods for identification of DMRs, which were not compared in the dmrseq paper. We used nearly all methods available, except the ones which do not have a proper statistical description or have not been updated for a long time. We found that dmrseq outperforms all the methods that we compared. This is especially important when we take into acccount that we used just 3 samples per group, which is the most one can hope for in the real experiments (due to the high cost of WGBS). But, it has to be mentioned that the data we used were the same as in the original dmrseq paper and that the simulated DMRs were created using a function which is a part of the dmrseq method. It would be nice and useful to repeat the experiment with some independently generated data.
All the methods were well documented and were easy to use for a non-experienced user, except for CGmapTools, where the documentation was not clear.
devtools::session_info()## ─ Session info ──────────────────────────────────────────────────────────
## setting value
## version R version 3.5.1 (2018-07-02)
## os Ubuntu 16.04.5 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate de_DE.UTF-8
## ctype de_DE.UTF-8
## tz Europe/Zurich
## date 2019-01-11
##
## ─ Packages ──────────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.1)
## backports 1.1.3 2018-12-14 [1] CRAN (R 3.5.1)
## bindr 0.1.1 2018-03-13 [1] CRAN (R 3.5.1)
## bindrcpp 0.2.2 2018-03-29 [1] CRAN (R 3.5.1)
## brew 1.0-6 2011-04-13 [1] CRAN (R 3.5.1)
## callr 3.1.1 2018-12-21 [1] CRAN (R 3.5.1)
## cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.1)
## colorspace 1.3-2 2016-12-14 [1] CRAN (R 3.5.1)
## crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.1)
## desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.1)
## devtools 2.0.1 2018-10-26 [1] CRAN (R 3.5.1)
## DiagrammeR * 1.0.0 2018-03-01 [1] CRAN (R 3.5.1)
## digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.1)
## downloader 0.4 2015-07-09 [1] CRAN (R 3.5.1)
## dplyr 0.7.8 2018-11-10 [1] CRAN (R 3.5.1)
## evaluate 0.12 2018-10-09 [1] CRAN (R 3.5.1)
## fs 1.2.6 2018-08-23 [1] CRAN (R 3.5.1)
## ggplot2 3.1.0 2018-10-25 [1] CRAN (R 3.5.1)
## glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.1)
## gridExtra 2.3 2017-09-09 [1] CRAN (R 3.5.1)
## gtable 0.2.0 2016-02-26 [1] CRAN (R 3.5.1)
## hms 0.4.2 2018-03-10 [1] CRAN (R 3.5.1)
## htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.1)
## htmlwidgets 1.3 2018-09-30 [1] CRAN (R 3.5.1)
## igraph 1.2.2 2018-07-27 [1] CRAN (R 3.5.1)
## influenceR 0.1.0 2015-09-03 [1] CRAN (R 3.5.1)
## jsonlite 1.6 2018-12-07 [1] CRAN (R 3.5.1)
## knitr 1.21 2018-12-10 [1] CRAN (R 3.5.1)
## lazyeval 0.2.1 2017-10-29 [1] CRAN (R 3.5.1)
## magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.1)
## memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.1)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 3.5.1)
## pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.1)
## pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.5.1)
## pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.1)
## pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.1)
## plyr 1.8.4 2016-06-08 [1] CRAN (R 3.5.1)
## prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.1)
## processx 3.2.1 2018-12-05 [1] CRAN (R 3.5.1)
## ps 1.3.0 2018-12-21 [1] CRAN (R 3.5.1)
## purrr 0.2.5 2018-05-29 [1] CRAN (R 3.5.1)
## R6 2.3.0 2018-10-04 [1] CRAN (R 3.5.1)
## RColorBrewer 1.1-2 2014-12-07 [1] CRAN (R 3.5.1)
## Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.1)
## readr 1.3.1 2018-12-21 [1] CRAN (R 3.5.1)
## remotes 2.0.2 2018-10-30 [1] CRAN (R 3.5.1)
## rgexf 0.15.3 2015-03-24 [1] CRAN (R 3.5.1)
## rlang 0.3.0.1 2018-10-25 [1] CRAN (R 3.5.1)
## rmarkdown 1.11 2018-12-08 [1] CRAN (R 3.5.1)
## Rook 1.1-1 2014-10-20 [1] CRAN (R 3.5.1)
## rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.1)
## rstudioapi 0.8 2018-10-02 [1] CRAN (R 3.5.1)
## scales 1.0.0 2018-08-09 [1] CRAN (R 3.5.1)
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.1)
## stringi 1.2.4 2018-07-20 [1] CRAN (R 3.5.1)
## stringr 1.3.1 2018-05-10 [1] CRAN (R 3.5.1)
## testthat 2.0.1 2018-10-13 [1] CRAN (R 3.5.1)
## tibble 1.4.2 2018-01-22 [1] CRAN (R 3.5.1)
## tidyr 0.8.2 2018-10-28 [1] CRAN (R 3.5.1)
## tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.1)
## usethis 1.4.0 2018-08-14 [1] CRAN (R 3.5.1)
## viridis 0.5.1 2018-03-29 [1] CRAN (R 3.5.1)
## viridisLite 0.3.0 2018-02-01 [1] CRAN (R 3.5.1)
## visNetwork 2.0.5 2018-12-05 [1] CRAN (R 3.5.1)
## withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.1)
## xfun 0.4 2018-10-23 [1] CRAN (R 3.5.1)
## XML 3.98-1.16 2018-08-19 [1] CRAN (R 3.5.1)
## yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.1)
##
## [1] /home/ubuntu/R/x86_64-pc-linux-gnu-library/3.5
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library